[MLX] Support Qwen3.5-35B-3A by metascroy · Pull Request #18785 · pytorch/executorch

metascroy · 2026-04-09T00:23:42Z

Qwen 3.5 MoE — MLX Backend Support

Adds --backend mlx to the existing Qwen 3.5 MoE export script, enabling export and inference on Apple Silicon via the MLX delegate.

What changed

Unified export (examples/models/qwen3_5_moe/export.py)

Added --backend mlx alongside existing CUDA path. CUDA path is unchanged.
Added --model-id for automatic HuggingFace download.
Added --tiny-test for CI validation with random weights (~30s, no download).

MLX source transformations (examples/models/qwen3_5_moe/mlx_source_transformations.py)

Replaces Triton-dependent modules with MLX equivalents: FusedMoEExperts → SwitchMLP, GatedDeltaNet → mlx::gated_delta_rule custom op, FullAttention → mlx::rope, KVCache → MLX KVCache, GemmaRMSNorm → F.rms_norm, SparseMoE → removes unnecessary dtype casts.

SwitchLinear / SwitchMLP (backends/mlx/llm/switch.py)

Per-expert linear using mlx::gather_mm / mlx::gather_qmm custom ops.
SwitchMLP: reusable gated MoE MLP with configurable activation and optional gate+up fusion.

Gated delta rule (backends/mlx/model_ops/gated_delta_rule.py)

Custom op with mutates_args=("state",) for recurrent state carry-forward.
Pattern handler emits MetalKernelNode (fused GPU kernel) or ScanNode (fallback), selected via use_custom_kernel kwarg on the op.

New ops / schema

mlx::gather_mm, mlx::gather_qmm: fused gather + matmul for MoE expert selection.
GatherMmNode, GatherQmmNode, ScanNode, MetalKernelNode, ScatterAddNode added to FlatBuffer schema + C++ runtime.

Python runner (examples/models/qwen3_5_moe/run.py)

ExecuTorch pybinding runner with tokenizer support and vocab size auto-detection from .pte metadata.

CI (.github/workflows/mlx.yml)

test-mlx-qwen35-moe: tiny model export + inference with deterministic output assertion + AsType node count check (≤23).
test_gated_delta_rule tests added to test-mlx job.

Usage

Export (downloads model automatically):

python export.py --model-id Qwen/Qwen3.5-35B-A3B --backend mlx --qlinear 4w --qlinear-group-size 64 --output-dir ./qwen35_moe_mlx

Run:

python -m executorch.examples.models.qwen3_5_moe.run --pte ./qwen35_moe_mlx/model.pte --tokenizer Qwen/Qwen3.5-35B-A3B --prompt "What is the capital of France?"

CI test (no download):

python export.py --tiny-test --backend mlx --qlinear 4w --output-dir /tmp/tiny
python -m executorch.examples.models.qwen3_5_moe.run --pte /tmp/tiny/model.pte --prompt-len 4 --max-new-tokens 5

Further optimization ideas:

Write a chunked GDN kernel
Turn off expert sorting in decode

pytorch-bot · 2026-04-09T00:23:46Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18785

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Rolling out OSDC (ARC) runners on pull workflow for PyTorch trunk commits

⏳ No Failures, 13 Pending

As of commit 2b8e675 with merge base 651f2f2 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2026-04-09T00:24:28Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

# Qwen 3.5 MoE — MLX Backend Support Adds `--backend mlx` to the existing Qwen 3.5 MoE export script, enabling export and inference on Apple Silicon via the MLX delegate. ## What changed **Unified export** (`examples/models/qwen3_5_moe/export.py`) - Added `--backend mlx` alongside existing CUDA path. CUDA path is unchanged. - Added `--model-id` for automatic HuggingFace download. - Added `--tiny-test` for CI validation with random weights (~30s, no download). **MLX source transformations** (`examples/models/qwen3_5_moe/mlx_source_transformations.py`) - Replaces Triton-dependent modules with MLX equivalents: `FusedMoEExperts` → `SwitchMLP`, `GatedDeltaNet` → `mlx::gated_delta_rule` custom op, `FullAttention` → `mlx::rope`, `KVCache` → MLX KVCache, `GemmaRMSNorm` → `F.rms_norm`, `SparseMoE` → removes unnecessary dtype casts. **SwitchLinear / SwitchMLP** (`backends/mlx/llm/switch.py`) - Per-expert linear using `mlx::gather_mm` / `mlx::gather_qmm` custom ops. - `SwitchMLP`: reusable gated MoE MLP with configurable activation and optional gate+up fusion. **Gated delta rule** (`backends/mlx/model_ops/gated_delta_rule.py`) - Custom op with `mutates_args=("state",)` for recurrent state carry-forward. - Pattern handler emits `MetalKernelNode` (fused GPU kernel) or `ScanNode` (fallback), selected via `use_custom_kernel` kwarg on the op. **New ops / schema** - `mlx::gather_mm`, `mlx::gather_qmm`: fused gather + matmul for MoE expert selection. - `GatherMmNode`, `GatherQmmNode`, `ScanNode`, `MetalKernelNode`, `ScatterAddNode` added to FlatBuffer schema + C++ runtime. **Python runner** (`examples/models/qwen3_5_moe/run.py`) - ExecuTorch pybinding runner with tokenizer support and vocab size auto-detection from `.pte` metadata. **CI** (`.github/workflows/mlx.yml`) - `test-mlx-qwen35-moe`: tiny model export + inference with deterministic output assertion + AsType node count check (≤23). - `test_gated_delta_rule` tests added to `test-mlx` job. ## Usage Export (downloads model automatically): python export.py --model-id Qwen/Qwen3.5-35B-A3B --backend mlx --qlinear 4w --qlinear-group-size 64 --output-dir ./qwen35_moe_mlx Run: python -m executorch.examples.models.qwen3_5_moe.run --pte ./qwen35_moe_mlx/model.pte --tokenizer Qwen/Qwen3.5-35B-A3B --prompt "What is the capital of France?" CI test (no download): python export.py --tiny-test --backend mlx --qlinear 4w --output-dir /tmp/tiny python -m executorch.examples.models.qwen3_5_moe.run --pte /tmp/tiny/model.pte --prompt-len 4 --max-new-tokens 5 ## Further optimization ideas: * Write a chunked GDN kernel * Turn off expert sorting in decode

metascroy added 30 commits March 4, 2026 21:37

up

af6810b

up

0f03f2b

up

bf36732

up

6a2d455

up

493d9ea

up

5ee8ac4

up

0df21d9

up

93afd3e

up

0adbe8c

up

f0b8e71

up

5493ea1

up

51462b3

up

1add04d

up

d15ee3c

up

b7da263

up

681ae8c

up

848cbd3

up

1ae519b

up

b61faab

up

eb22885

up

b8f0fa6

up

1ed89ee

up

cbb43c7

up

7e5abd4

up

c534395

up

da81e86

up

74a3f1b

up

0b8b0af

up

afa912e

up

6e924fe

metascroy added 12 commits April 7, 2026 14:31

Merge branch 'main' into mlx-delegate-part3

fd01e12

up

8f047dd

Merge branch 'main' into mlx-delegate-part3

3cbfa30

up

984de05

Merge branch 'mlx-delegate-part3' into qwen-moe-part4

417963a

up

6e5f323

up

0d3a549

Merge branch 'mlx-delegate-part3' into qwen-moe-part4

4d833fe

Merge branch 'main' into qwen-moe-part4

46e8bbc

Merge branch 'main' into qwen-moe-part4

89b499b

up

2ca57a7

up

92c541a

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 9, 2026

metascroy added 2 commits April 8, 2026 20:20

up

47e5f99

up

2bf242c

metascroy marked this pull request as ready for review April 9, 2026 18:11

metascroy requested review from kirklandsign, larryliu0820 and lucylq as code owners April 9, 2026 18:11

metascroy requested review from digantdesai, manuelcandales and mergennachin April 9, 2026 18:15

manuelcandales approved these changes Apr 13, 2026

View reviewed changes

Merge branch 'main' into qwen-moe-part4

2b8e675

metascroy merged commit 57887ec into main Apr 13, 2026
184 of 187 checks passed

metascroy deleted the qwen-moe-part4 branch April 13, 2026 21:54

manuelcandales mentioned this pull request Apr 14, 2026

Metal backend: Add gated delta rule kernel for linear attention #18878

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLX] Support Qwen3.5-35B-3A#18785

[MLX] Support Qwen3.5-35B-3A#18785
metascroy merged 49 commits intomainfrom
qwen-moe-part4

metascroy commented Apr 9, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

metascroy commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen 3.5 MoE — MLX Backend Support

What changed

Usage

Further optimization ideas:

Uh oh!

pytorch-bot bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18785

❗ 1 Active SEVs

⏳ No Failures, 13 Pending

Uh oh!

github-actions bot commented Apr 9, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

metascroy commented Apr 9, 2026 •

edited

Loading

pytorch-bot bot commented Apr 9, 2026 •

edited

Loading

This PR needs a `release notes:` label